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I 

Abstract 

V 

Images  are  :wo  dimensional  projections  of 
ree  dimensional  scenes,  therefore  depth  recovery 
s a crucial  problem  in  Image  Understanding,  with 
applications  in  passive  navigation,  cartography, 
surveillance,  and  industrial  robotics.  Stereo 
analysis  provides  a more  direct  quantitative  depth 
evaluation  than  techniques  such  as  shape  from  shad- 
ing, and  its  being  passive  makes  it  more  applicable 
than  active  range  finding  imagery  by  laser  or 
radar.  This  paper  addresses  the  subproblem  of 
identifying  corresponding  points  in  the  two  images. 
The  primitives  we  are  using  are  groups  of  collinear 
connected  edge  points  called  segment  s , and  we  base 
the  correspondence  on  the  minimum  i"di fferential 
disparity"  criterion.  The  result  of  this  process- 
ing is  a sparse  array  disparity  map  of  the  analyzed 
scene . 

r 

I.  Introduction 

The  human  visual  system  perceives  depth  with 
no  apparent  effort  and  very  few  mistakes,  but  how 
it  does  so  is  not  understood.  Binocular  stereopsis 
plays  a key  role  in  this  process,  and  the 
straightforward  extraction  of  depth  it  provides, 
once  corresponding  points  are  identified,  makes  it 
very  attractive.  Depth  recovery  is  necessary  in 
domains  such  as  passive  navigation [ Gennery80, 
Moravec80] , cartography [Kelly77,  Panton78j, 
surveil lance[ Henderson79 ] and  industrial  robotics. 
Proposed  solutions  for  the  stereo  problem  follow  a 
paradigm  involving  the  following  steps [ Barnard82 ] : 

-image  acquisition, 

-camera  modeling, 

-feature  acquisition, 

-image  matching, 

-depth  determination, 

-interpolation . 

The  hardest  step  is  image  matching,  that  is  iden- 
tifying corresponding  points  in  two  images,  and 
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this  paper  is  solely  devoted  to  it.  The  next  sec- 
tion reviews  the  existing  systems  that  have  been 
proposed  so  far,  divided  in  two  broad  classes, 
area-based  and  edge-based,  then  we  summarize  our 
assumptions  and  give  a formal  description  of  the 
method.  The  fourth  section  presents  results,  and  we 
then  discuss  extensions. 


II.  Review  of  existing  methods 

Two  classes  of  techniques  have  been  used  for 
stereo  matching,  area-based  and  feature-based. 


2.1.  Area-based  stereo 

Ideally,  one  would  like  to  find  a correspond- 
ing pixel  for  each  pixel  in  each  image  of  a stereo 
pair,  but  the  semantic  information  conveyed  by  a 
single  pixel  is  too  low  to  resolve  ambiguous 
matches,  therefore  we  have  to  consider  an  area  or 
neighborhood  around  each  pixel , and  use 
correlation-based  matching  algorithms  to  determine 
the  corresponding  match,  it  is  therefore  using 
local  context  to  resolve  ambiguities.  The  jus- 
tification for  such  an  approach  is  that  of 
"continuity",  that  is  disparity  values  change 
smoothly,  except  at  a few  depth  discontinuities. 
All  systems  based  on  area-correlation  suffer  from 
the  same  limitations: 

- They  require  the  presence  of  a detectable 
texture  within  each  correlation  window, 
therefore  they  tend  to  fail  in  feature- 
less or  repetitive  texture  environments. 

- They  tend  to  be  confused  by  the  presence 
of  a surface  discontinuity  in  a correla- 
tion window. 

- They  are  sensitive  to  absolute  intensity, 
contrast  and  illumination. 

- They  get  confused  in  rapidly  changing 
depth  fields  (vegetation.) 

For  these  reasons,  the  existing  systems,  specially 
the  ones  used  in  "automatic"  cartography,  require 
the  intervention  of  human  operators  to  guide  them 
and  correct  them.  Such  systems  are  described  in 
[Lucaa81 , Panton78,  Hannah80,  Barnard80, 
Moravec79] . 
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2.2.  Feature-based  systems 

The  depth  information  in  stereo  analysis  is 
conveyed  by  the  differences  in  the  two  images  of  a 
stereo  pair  due  to  the  different  viewpoints,  the 
differences  being  most  prcminent  at  the  discon- 
tinuities, or  edges.  Obviously,  matching  of  fea- 
tures will  not  provide  a full  depth  map,  and  must 
be  followed  by  an  interpolating  scheme.  The  common 
characteristics  of  feature-based  matching  tech- 
niques are  : 

- They  are  faster  than  area -based  methods, 
because  there  are  many  fewer  points  to 
consider  . 

- The  obtained  match  is  more  accurate, 

edges  can  even  be  located  with  sub-pixel 
precision! Binford81  I . 

- They  are  less  sensitive  to  photometric 

variations,  since  they  represent 

geometric  properties  of  a scene. 

Henderson [ Henderson79 ] considered  scenes  represent- 
ing cultural  sites  (man-made  structures)  and 
matched  edge  points  on  epipolar  lines  in  the  two 
views.  He  reduced  ambiguity  by  assuming  continuity 
between  consecutive  epipolar  lines.  Marr  and  Pog- 
gio  have  relied  on  two  apparently  simple 
constraints  [Marr79] : 

1 . Uniqueness . 

Each  point  in  an  image  may  be  assigned 
at  most  one  disparity  value.  One  may 
note  that  this  assumption  is  not  correct 
for  transparent  objects. 

2 . Continuity ■ 

Matter  is  cohesive,  therefore  values 
change  smoothly,  except  at  a few  depth 
discontinuities . 

They  first  proposed  a cooperative  algor ithm[ Marr 76 ] 
that  works  very  well  on  random-dot  stereograms,  but 
they  rejected  it  to  propose  one  of  more  heuristic 
nature,  implemented  by  Crimson[Grimson79 , 

Grimson81]  that  generates  good  results,  given  the 
very  few  assumptions.  Arnold  I Arnold78 ] matches 

edges  using  local  context,  and  his  system  seems  to 
perform  well  on  cultural  scenes.  Finally,  Baker 
and  Bin  ford [ Baker 82 ] match  edges  on  epipolar  lines 
by  using  the  no-reversal  constraint  that  the  order 
of  the  match  has  to  be  preserved,  in  addition  to 
uniqueness  and  continuity.  They  also  consider  con- 
tinuity by  examining  adjacent  epipolar  lines.  This 
system  appears  to  perform  reasonably  on  a wide 
variety  of  images. 

In  most  of  the  systems  presented  above,  a con- 
siderable saving  in  search  time  is  obtained  by  a 
coarse  to  fine  matching,  that  is  the  mstching  is 
originally  done  on  a low-resolution  version  of  the 
image  and  the  results  are  propsgated  to  the  higher 
resolution  version.  However,  it  should  be  noted 
that  in  current  implementations,  good  matches  as 
well  as  errors  tend  to  propagate  from  one  level  to 
the  next . 


III.  The  Minimal  Differential 
Disparity  Algorithm 

From  the  survey  conducted  above,  it  appears 

that  feature-based  techniques  are  more  appropriate 
to  solve  the  correspondence  problem,  but  edges  as  a 
primitive  seem  to  be  too  low-level,  and  a connec- 
tivity check  is  needed  to  remove  spurious  matches. 
High  level  primitives  such  as  physical  object  boun- 
daries or  surface  descriptions  would  be  preferred, 
however,  stereo  processing  may  need  to  precede  ^ne 
computation  of  such  descriptions.  As  a step 

towards  higher  level  primitives,  we  are  using 
segments.  In  order  to  generate  them,  we  fit 

straight  lines  through  adjacent  edge  points  with  a 
given  tolerance  of  one  pixel.  These  segments  can 
be  described  by  : 

- coordinates  of  the  end  points 

- orientation 

- strength  (average  contrast) 

By  using  these  primitives,  we  implicitly  assume  the 
connectivity  constraint.  When  matching  segments, 
we  need  to  allow  one  segment  to  possibly  match  with 
more  than  one  segment  in  the  other  image  (i.e.  to 

allow  for  fragmented  segments),  even  if  we  wish  to 
preserve  unique  matches  tor  the  individual  edge 
points.  Also,  instead  of  considering  one  epipolar 
li  ie  at  a time,  we  have  to  consider  all  epipolar 
lines  in  which  a given  segment  appears. 


3.1.  Assumptions  and  Definitions 

We  consider  a simple  camera  geometry  in  which 
the  epipolar  plane , defined  as  the  plane  passing 
through  an  object  point  and  the  two  camera  foci, 
intersects  the  two  image  planes,  so  defining 
epipolar  lines  parallel  to  the  y axis.  Therefore, 
corresponding  points  must  lie  on  corresponding 
epipolai  lines,  that  is  have  the  same  row  value, 
this  is  illustrated  in  Figure  3-1. 


Figure  3-1:  Collinear  Epipolar  Geometry 

from  [Baker82) 
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We  also  give  a bound  on  the  disparity  range  allow- 
able for  any  given  segment,  let  us  call  it  max  . 

Let  A=(  a • ) be  the  set  of  segments  in  the  left  image 
Let  B={bj}  be  the  set  of  segments  in  the  right 

Then,  for  each  segment  a^resp.  b.)  in  the  left 
(resp.  right)  image,  we  can  define  a window 

wUXresp.  w(j))  in  which  corresponding  segments 
from  the  right  (resp.  left)  image  must  lie.  The 
shape  of  this  window  is  a parallelogram,  one  median 
being  a, (resp.  b -) , the  other  a horizontal  vector 
of  length  2*maxd.  One  can  see  that  a{  in  w(j)  im 

pliesbiinw(i).  . . . . 

We  define  the  boolean  function  p(i,j)  relating  two 
segments  as; 

- p(i  , j)  is  true  if 
b j over  1 aps  w( i ) 


-a-  , b.  have  "similar"  contrast 

_ a!  > bj  have  "similar"  orientatic 


The  required  similarity  in  orientation  is  loose  and 
is  a function  of  the  segment  length.  We  have  set 
it  to  be  25  degrees  for  long  segments  and  up  to  90 
degrees  for  very  short  segments. 

Two  segments  are  defined  to  have  similar  contrast 
if  the  absolute  value  of  the  difference  of  the  in- 
dividual contrasts  is  less  than  20/.  of  the  larger 
one . 

To  each  pair  (i,j)  such  thit  p(i,j)  is  true  we  as- 
sociate an  average  disparity  d-  which  is  the 
average  of  the  disparity  between  the  two  segments 
a-  and  b:  along  the  length  of  their  overlap. 

We  define  the  two  functions  SI  and  S2  as: 


At  iteration  1 

E 
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At  the  end  of  each  iteration,  we  define  the  cets 
Qfa^)  and  Q(bj)  as 


j in  Qfa^  and  i in  Q(b^)  if 
Vk  in  SHa^,  v*  (i,  j)  <y*  ( i,k) 
AND 

Vh  in  Sl(bj),  v* ( i , j ) <v* (h, j ) 


For  any  iteration  after  the  first  one,  the  computation 
of  vt(i,j)  becomes 


v'u.j)-/  y; 

vsi(bj,ui 

•(  s 


min  | a . -a  , 
i „/  y hk  ij 
b EQ(a  ) 


'^card(bj) 


Vbj 


min  Id^-d.,,  iVcardt.j) 


' hk  ij 


b.  eS.  (a,)US_(a.)  ahrai 
k 1 i ? i 


if  the  sets  Q are  not  empty,  otherwise  the  computation 
of  the  function  v is  done  using  the  formula  for  iteration  1. 


Sl( ai )“{ j | b ! in  w(i) 

and  p( i , j ) is  true} 

S2( a: )={ j |b  • in  w( i) 

and  p(i,i)  is  false} 

Similarly,  we  define  Sl(b-)  and  S2(bj).  We  will 
also  need  the  value  card(a^),  which  is  the  number 
of  elements  in  the  set  SKa^)  32(8^. 

It  is  to  be  noted  that  all  the  functions  described 
above  .ire  static,  meaning  that  they  are  computed 
only  once . 


3.2.  Description 

Each  possible  match  is  evaluated  by  computing 
a measure  of  the  distortion  this  match  provokes  for 
its  neighbors, i .e . given  that  (i,j)  is  a correct 
match  with  its  associated  disparity  d— , how  well 
do  the  neighbors  agree  with  this  propoaed  dis- 
parity?  We  compute  an  evaluation  of  the  match 
(i  j)  and  compare  to  the  matchea  (i,k)  and  (h,j) 
for  k in  SI ( a • ) and  h in  Sl(b,).  If  the  evaluation 
is  minimum  for  (i,j),  then  j is  the  preferred  in- 
terpretation for  i and  i is  the  preferred  inter- 
pretation for  j.  For  any  iteration  after  the  first 
one,  in  order  to  evaluate  a match  (i,j),  we  only 
look  at  the  preferred  matches  for  the  neighbors  of 
i and  j,  if  they  have  any.  Formally,  the  compu- 
tation of  v (i,j)  is: 


At  tie  last  iteration,  only  those  elements 
that  have  a preferred  match  are  considered  valid, 
and  a disparity  map  array  is  filled  using  these 
values.  It  is  interesting  to  note  that  this  process 
is  absolutely  symmetric  in  the  two  views  and  there- 
fore will  yield  identical  results  (except  for  the 
sign  of  the  disparity)  if  the  two  views  are  inter- 
changed. It  is  helpful  to  look  at  a simple  example 
to  understand  this  process. 


i. 3.  Example 

Let  our  2 views  be  the  ones  shown  in  Figure 
3-2  below: 


Figure  3-2:  A aimple  example 


130 


In  absence  of  any  extra  information,  the  correct 
interpretation  is  that  the  3 points  have  the  same 
disparity,  and  the  result  of  the  matching  is 
(a^  ,b|)  for  i in  { 1 , 2, 3}  . 

In  this  example,  S 1 ( a ^ ) =S 1 (b • )={ 1 , 2,  3}  and 
S2(a^ ) =S 2 ( b ■ )=  0.  Ihe  array  d^  is 

0 1 2 

-1  0 1 

-2  -I  0 

Therefore  we  find 

v1(l,D=  (Id,2-d  | + |d  -d  i)/3 

+ (ld22-dllWd33-dnl)/3 

= 0 

compared  to 

v1(l,2)=  (ld23-d12!  + ld33-d12l)/3 
d (|d21-d12|+|d23-d12|)/3 
= 1 

and  to 

v 1 ( 1 ’ 3)=  (|d  -d  |+|d  -d  |)/3 

+ (ld12-dBl  + ldll-dl3l)/3 

= 2.67 


The  calculations  are  similar  for  the  other  pairs, 
so,  at  the  end  of  the  first  iteration,  the 
preferred  interpretations  are  only  the  correct 
ones,  and  further  iterations  will  not  alter  the 
results . 


3,4.  Discussion 

The  criterion  used  here,  namely  the  minimal 
differential  disparity,  has  similarities  with  the 
edge  interval  constraints  given  in  [ArnoldSO]  and 
subsequently  used  by  Baker[Baker  82] , but  looser  in 
the  sense  that  it  does  not  require  ordering  of  the 
edges.  Since  our  criterion  does  not  take  ordering 
into  account,  a dynamic  programming  implementation 
is  not  possible.  Our  evaluation  function  is  more 
informed  than  Baker's  in  the  sense  that  it  con- 
siders all  edges  in  a neighborhood  instead  of  just 
the  predecessor  and  successor  of  a given  edge.  The 
performance  of  this  algorithm  on  a few  examples  is 
presented  next . 


IV.  Results 

It  is  difficult  to  display  results  of  stereo 
matching  meaningfully,  especially  in  a two  dimen- 
sional picture,  since  we  only  generate  a sparse 
disparity  map.  We  will  simply  show  the  line  seg- 
ments in  the  two  views  that  are  found  to  match.  We 
have  not  been  able  to  master  the  art  of  cro  s-eyed 
stereo  lusion,  but  since  a number  of  people  in  the 
field  are  good  at  it,  we  will  present  all  pairs  of 
images  according  to  its  convention,  that  is  the 


left  view  is  shown  on  the  right  and  the  right  view 
on  the  left.  All  results  will  also  be  shown  this 
way,  without  explicitly  marking  each  point  and  its 
correspondence.  We  first  started  our  experiments 
with  very  simple  line  drawings,  slightly  more  com- 
plex than  the  one  shown  in  Figure  3-2  and  the 
results  matched  the  expectations.  In  order  to 
remove  the  effects  of  the  segmentation  procedure  on 
the  performance  of  our  matching  technique,  we  hand- 
segmented  the  images  shown  in  Figure  4-1  by  tracing 
the  boundaries  of  the  objects  on  a digitizing 
tab'e.  This  image,  from  Control  Data  Corporation, 
is  synthetic  and  has  been  used  by  Baker [ Baker82 ] 
for  his  experiments.  The  resulting  segments  are 
shown  on  Figure  4-2  and  Figure  4-3  displays  the 
results  after  matching.  All  the  lines  that  have 
been  matched  have  the  correct  correspondence,  but 
some  matches  are  missed.  This  is  due  to  the  fact 
that  when  the  matcher  gets  confused  by  closely  com- 
peting assignments,  it  chooses  not  to  assign  a 
label.  Also,  some  edges  are  not  matches  because  of 
mistakes  in  the  tracing  procedure:  we  traced  the 

boundaries  of  some  objects  in  opposite  directions 
in  the  l wo  views . 

For  all  other  examples,  r ge  detection  was  per- 
formed automatically  usin'  a technique  developed  by 
Nevatia  and  Babu [ Nevat ia80 ] that  finds  edge  mag- 
nitude and  direction  by  convolving  the  image  with 
edge  masks  in  different  orientations  (we  used  5x5 
masks  in  6 directions  here).  These  edges  are  then 
linked  to  form  boundary  curves  which  are  ap- 

proximated by  piecewise  linear  segments. 

Next,  consider  the  industrial  part  shown  in 
Figure  4-4,  the  original  resolution  is  256  by  256 

and  the  gray  levels  are  coded  on  8 bits.  We  ap- 
plied the  matching  algorithm  to  two  different 
resolutions  of  the  image,  running  it  through  three 
iterations.  It  was  found  that  no  assignment  was 

changed  after  three  iterations  in  our  experiments. 
Figure  4-5  shows  the  original  edges  and  Figure 
4-6  displays  the  results  in  the  above  mentioned 
form.  Similarly,  Figure  4-7  shows  the  segments  at 
half  resolution  and  Figure  4-8  the  results.  Look- 
ing at  the  segments  one  by  one,  we  did  not  notice 
any  spurious  assignment  at  either  resolution,  mean- 
ing that  we  captured  the  shape  of  the  object,  even 
though  the  density  of  edges  is  much  larger  than  in 
the  previous  example. 

Another,  more  complex  image  is  shown  on  Figure 
4-9.  In  this  image,  we  have  a wide  range  of  dis- 
parities, a change  of  sign  in  the  disparities 
across  the  picture,  various  occlusions,  the 
presence  of  a repetitive  structure  (a  Rubik's  ct  .ej 
and  contrast  reversal . We  do  not  expect  to  get 
good  results  with  this  contrast  reversal  since  one 
of  our  preliminary  conditions  is  similarity  in  con- 
trast, but  the  other  peculiarities  are  very  inter- 
esting. We  worked  at  low  resolution  on  the  seg- 
ments shown  in  Figure  4-10  to  obtain  the  result., 
shown  in  Figure  4-11.  The  interesting  points  are 
the  following: 

- The  elongated  vertical  blocks  in  the  rear 

of  the  image  are  correctly  put  into  cor- 
respondence . 


131 


- All  the  squares  of  the  cube  that  should 
be  identified  are  correctly  matched.  The 
correct  labeling  appeared  at  iteration  2 
(at  iteration  1,  most  of  them  are  only 
ambiguously  matched.) 

The  segments  at  high  resolution  are  shown  in  Figure 
4-12  and  the  matching  results  in  Figure  4-13.  We 
did  not  use  the  results  at  low  resolution  to  guide 
the  matching  at  high  resolution,  therefore  the 
elongated  block  in  the  rear  right  is  not  matched 
any  longer.  It  is  interesting  to  note  that  the 

edges  coming  from  the  texture  of  the  wood  blocks  do 
not  create  confusion,  but  help  the  matching,  on  the 
front  cylinder  for  example.  Once  again,  most  as- 
signed matches  are  correct. 


V.  Conclusions 

This  research  is  far  from  being  in  a final 
state.  The  initial  encouraging  results  presented 
here  must  therefore  only  be  viewed  as  an  indication 
that  the  hypothesis  of  minimal  differential  dis- 
parity may  be  useful.  The  critical  points  that 
must  be  examined  are: 

- Relax  the  contrast  constraint.  This  may 
be  done  by  considering  not  the  contrast 
of  an  edge,  but  the  intensity  values  on 
each  side.  Edges  could  then  be  matched 
if  either  their  left  side  or  their  right 
aid<-  correspond.  One  may  eventually  con- 
sider an  edge  as  a doublet [ Baker82]  and 
matcii  each  side  separately. 

- To  refine  the  formulation  of  the  evalua- 
tion formula.  Statistical  analysis  may 
yield  better  functions,  maybe  by  intro- 
ducing a static  probability  measure  to 
evaluate  each  match  based  on  similarity 
of  intrinsic  properties  (length,  color, 
orientation.)  Also  of  concern  is  a more 
accurate  definition  of  a no— match  label, 
which  is  obtained  if  a match  pair  is  not 
clearly  better  than  the  competing  ones. 

- Further  extensive  testing  is  also  re- 
quired on  aerial  and  near  range  imagery, 
with  terrain  models  for  accuracy  check- 
ing. 

- Finally,  we  must  use  an  interpolation 
scheme,  very  likely  intensity-based,  to 
generate  a full  disparity  map  of  the 
scene  depth. 
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Figure  4-1:  Synthetic  image  [256x256x6] 


Figure  4-2:  Hand  generated  segments 
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Figure  4-7:  Segments  from  the  half  resolution  image 


Figure  4-9:  Image  of  some  blocks l 5 12x5 1 2x7] 


Figure  .,-10:  Segments  at  low  resolution 
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Figure  4-12:  Segments  at  high  resolution 


Figure  4-13; 


Results  at  high  resolution 


